A-Malik03_OriginalHomeworkCode_03

Some of my best friends are Zombies…


Reflection

I enjoyed most of this homework, but some of it was a struggle. I really enjoyed trying to figure out how to make my own functions despite how frustrating it was. I struggled with question 6 because I misread the question and was trying to plot at 100 X 30 data points into a vector of vectors but realized I only needed the means of each random sample. I also struggled with finding a way to add to vectors. I used the internet and found the function “append” to make changes to a vector. Finally I needed to refresh my memory on how to analyze qq plots because I didn’t know how to use the graph. I reread Module 8 to grab a better understanding of quantile-quantile plots and its uses.I didn’t use the full capabilities of ggplot2 because I wanted to make sure I did the homework correctly. I plan on adding color and prettiness for the final submission.

1. Calculate the population mean and standard deviation for each quantitative random variable (height, weight, age, number of zombies killed, and years of education).

library(curl)
## Using libcurl 7.79.1 with LibreSSL/3.3.6
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::parse_date() masks curl::parse_date()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
f <-curl("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/zombies.csv")
d <-read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
head(d) #make sure it's set up right
##   id first_name last_name gender   height   weight zombies_killed
## 1  1      Sarah    Little Female 62.88951 132.0872              2
## 2  2       Mark    Duncan   Male 67.80277 146.3753              5
## 3  3    Brandon     Perez   Male 72.12908 152.9370              1
## 4  4      Roger   Coleman   Male 66.78484 129.7418              5
## 5  5      Tammy    Powell Female 64.71832 132.4265              4
## 6  6    Anthony     Green   Male 71.24326 152.5246              1
##   years_of_education                           major      age
## 1                  1                medicine/nursing 17.64275
## 2                  3 criminal justice administration 22.58951
## 3                  1                       education 21.91276
## 4                  6                  energy studies 18.19058
## 5                  3                       logistics 21.10399
## 6                  4                  energy studies 21.48355
(mean.height <- mean(d$height))
## [1] 67.6301
(mean.weight <- mean(d$weight))
## [1] 143.9075
(mean.age <- mean(d$weight))
## [1] 143.9075
(mean.kills <- mean(d$zombies_killed))
## [1] 2.992
(mean.edu <- mean(d$years_of_education))
## [1] 2.996
pop.sd <- function(x){ #create a function for population SD
  sqrt(
    sum(
      (x - mean(x))^2
    )
    / length(x)
  )
}
(height.sd <- pop.sd(d$height))
## [1] 4.30797
(weight.sd <- pop.sd(d$weight))
## [1] 18.39186
(age.sd <- pop.sd(d$age))
## [1] 2.963583
(kills.sd <- pop.sd(d$zombies_killed))
## [1] 1.747551
(edu.sd <- pop.sd(d$years_of_education))
## [1] 1.675704

2. Use {ggplot} to make boxplots of each of these variables by gender.

heightplot <- ggplot(data = d, aes(x = gender, y = height)) + geom_boxplot() + xlab("Gender") + ylab("Height(inches)")
heightplot

weightplot <- ggplot(data = d, aes(x = gender, y = weight)) + geom_boxplot() + xlab("Gender") + ylab("Weight(pounds)")
weightplot

ageplot <- ggplot(data = d, aes(x = gender, y = age)) + geom_boxplot() + xlab("Gender") + ylab("Age(years)")
ageplot

killsplot <- ggplot(data = d, aes(x = gender, y = zombies_killed)) + geom_boxplot() + xlab("Gender") + ylab("Zombies Killed")
killsplot

eduplot <- ggplot(data = d, aes(x = gender, y = years_of_education)) + geom_boxplot() + xlab("Gender") + ylab("Years of Education")
eduplot

4. Using histograms and Q-Q plots, check whether the quantitative variables seem to be drawn from a normal distribution. Which seem to be and which do not (hint: not all are drawn from the normal distribution)? For those that are not normal, can you determine from which common distribution they are drawn?

par(mfrow = c(1,2))
hist(d$height) #Plots a basic histogram
qqnorm(d$height, frame = FALSE) #Plots a qqgraph of data points
qqline(d$height) #plots a qq line 


Yes, based on the histogram and the qqplot, the height of the survivors are drawn from a normal distribution curve.

par(mfrow = c(1,2))
hist(d$weight)
qqnorm(d$weight, frame = FALSE)
qqline(d$weight)


Yes, based on the histogram and the qqplot, the weight of the survivors are drawn from a normal distribution curve.

par(mfrow = c(1,2))
hist(d$age)
qqnorm(d$age, frame = FALSE)
qqline(d$age)


Yes, based on the histogram and the qqplot, the age of the survivors are drawn from a normal distribution curve.

par(mfrow = c(1,2))
hist(d$zombies_killed)
qqnorm(d$zombies_killed, frame = FALSE)
qqline(d$zombies_killed)


No, based on the graphs, the data does not stem from a normal distributions. This seems like the zombies killed by survivors data is drawn from a right-skewed distribution.

par(mfrow = c(1,2))
hist(d$years_of_education)
qqnorm(d$years_of_education, frame = FALSE)
qqline(d$years_of_education)


No, the data for survivors’ years of education is not drawn from a normal disrtribution. The data is looks heavily right skewed.

5. Now use the sample() function to sample ONE subset of 30 zombie survivors (without replacement) from this population and calculate the mean and sample standard deviation for each variable. Also estimate the standard error for each variable, and construct the 95% confidence interval for each mean. Note that for the variables that are not drawn from the normal distribution, you may need to base your estimate of the CIs on slightly different code than for the normal…

(samplesized <- sample_n(d, size = 30, replace = FALSE))
##     id first_name last_name    gender   height   weight zombies_killed
## 1  461       Mark     Allen      Male 76.52642 157.7459              7
## 2  183    Douglas    Miller      Male 68.11586 154.0981              1
## 3  210  Stephanie    Watson    Female 66.71065 130.7163              2
## 4  973    Shirley   Bradley    Female 59.14821 112.2698              2
## 5  400    Melissa Alexander    Female 61.40602 116.9725              6
## 6  159      James   Mendoza NonBinary 68.33511 155.5868              4
## 7  238     Dennis     Meyer      Male 75.85361 188.9251              3
## 8  970     Eugene  Ferguson      Male 60.32218 104.5811              5
## 9  174       Mary    Fowler    Female 65.08253 113.5358              3
## 10 453      Brian   Simpson      Male 65.91163 128.1662              2
## 11 326       Mary      Bell    Female 73.62638 162.0133              3
## 12 456   Patricia     Wells    Female 62.54020 127.7218              2
## 13 373      Roger     Clark      Male 70.07439 147.3089              1
## 14 357     Andrew  Hamilton      Male 68.54200 144.7081              4
## 15 356      Larry   Simpson      Male 70.42895 134.4735              2
## 16 116      James   Garrett      Male 65.53651 137.9662              3
## 17 935     Nicole      Cook    Female 65.41633 137.3064              1
## 18 548 Jacqueline    Barnes    Female 69.25874 150.9086              3
## 19   1      Sarah    Little    Female 62.88951 132.0872              2
## 20 607       Jane      Ross    Female 65.21890 131.3038              2
## 21 520      Julie  Crawford    Female 66.06300 139.6257              4
## 22  88    Phyllis   Wheeler    Female 58.74318 109.6111              2
## 23 301   Clarence   Edwards      Male 68.39093 167.3073              3
## 24 353     Teresa      King    Female 72.10447 168.9203              1
## 25 351       Ruby   Kennedy NonBinary 64.59859 132.5271              4
## 26 951     Donald  Campbell      Male 77.45735 187.0345              0
## 27 728    Rebecca     Davis    Female 61.78157 139.1198              3
## 28 388    Carolyn   Hawkins    Female 61.45906 128.2554              8
## 29 772   Michelle    Holmes    Female 65.66844 132.6530              3
## 30 632       Mary     Henry    Female 64.06442 129.2723              4
##    years_of_education                  major      age
## 1                   4           architecture 27.68187
## 2                   4          communication 20.04527
## 3                   1          communication 20.79042
## 4                   2 mechanical engineering 17.89542
## 5                   3  agricultural sciences 20.89453
## 6                   2       applied sciences 17.36646
## 7                   4  environmental science 24.17555
## 8                   3       animal husbandry 18.61261
## 9                   2                biology 19.06514
## 10                  3             philosophy 18.05536
## 11                  3           architecture 25.36748
## 12                  3       applied sciences 17.31055
## 13                  5          city planning 18.59596
## 14                  3                 botany 18.95085
## 15                  4          city planning 23.05967
## 16                  1              education 15.39994
## 17                  4              economics 18.93918
## 18                  1           architecture 18.65582
## 19                  1       medicine/nursing 17.64275
## 20                  3                 botany 21.58286
## 21                  7           architecture 19.87794
## 22                  3          communication 16.47283
## 23                  4     physical education 16.51631
## 24                  7         human services 22.25385
## 25                  4             psychology 21.93930
## 26                  5           epidemiology 23.84402
## 27                  3     physical education 16.38548
## 28                  1       animal husbandry 17.07809
## 29                  2           pharmacology 19.54634
## 30                  2             psychology 18.18894
(mean.height.s <- mean(samplesized$height))
## [1] 66.70917
(mean.weight.s <- mean(samplesized$weight))
## [1] 140.0907
(mean.age.s <- mean(samplesized$age))
## [1] 19.73969
(mean.kills.s <- mean(samplesized$zombies_killed))
## [1] 3
(mean.edu.s <- mean(samplesized$years_of_education))
## [1] 3.133333
(sd.height.s <- pop.sd(samplesized$height))
## [1] 4.864182
(sd.weight.s <- pop.sd(samplesized$weight))
## [1] 20.59288
(sd.age.s <- pop.sd(samplesized$age))
## [1] 2.873525
(sd.kills.s <- pop.sd(samplesized$zombies_killed))
## [1] 1.75119
(sd.edu.s <- pop.sd(samplesized$years_of_education))
## [1] 1.543445
standarderror <- function(sd) {
  sd/sqrt(30)
}

(standarderror(sd.height.s))
## [1] 0.888074
(standarderror(sd.weight.s))
## [1] 3.759729
(standarderror(sd.age.s))
## [1] 0.5246315
(standarderror(sd.kills.s))
## [1] 0.3197221
(standarderror(sd.edu.s))
## [1] 0.2817932
xbar = NULL
sd= NULL
bounds <- function(xbar , sd){
  t.score <- qt(p=0.05/2, df = 29,, lower.tail = FALSE)
  lower.bound <- xbar - (t.score * sd/sqrt(30))
  upper.bound <- xbar + (t.score * sd/sqrt(30))
  print(c(lower.bound, upper.bound))
}
  
bounds(mean.height.s, sd.height.s)
## [1] 64.89286 68.52549
bounds(mean.weight.s, sd.weight.s)
## [1] 132.4012 147.7802
bounds(mean.age.s, sd.age.s)
## [1] 18.66670 20.81268
bounds(mean.kills.s, sd.kills.s)
## [1] 2.346095 3.653905
bounds(mean.edu.s, sd.edu.s)
## [1] 2.557002 3.709665
#Used this website to create bounds function: https://bookdown.org/logan_kelly/r_practice/p09.html

Now draw 99 more random samples of 30 zombie apocalypse survivors, and calculate the mean for each variable for each of these samples. Together with the first sample you drew, you now have a set of 100 means for each variable (each based on 30 observations), which constitutes a sampling distribution for each variable. What are the means and standard deviations of this distribution of means for each variable? How do the standard deviations of means compare to the standard errors estimated in [5]? What do these sampling distributions look like (a graph might help here)? Are they normally distributed? What about for those variables that you concluded were not originally drawn from a normal distribution?

meanheights <- c(mean.height.s)
meanweights <- c(mean.weight.s)
meanages <- c(mean.age.s)
meankills <- c(mean.kills.s)
meanedus <- c(mean.edu.s)



for (x in 1:99) { #create a loop that adds the mean vectors of the quantitative date to a vector of means
  rsample <- sample_n(d, size = 30, replace = FALSE)
  meanheights <- append(meanheights, mean(rsample$height))
  meanweights <- append(meanweights, mean(rsample$weight))
   meanages <- append(meanages, mean(rsample$age))
   meankills <- append(meankills, mean(rsample$zombies_killed))
   meanedus <- append(meanedus, mean(rsample$years_of_education))
}

meanheights
##   [1] 66.70917 67.73385 67.79023 66.14728 67.80466 68.25685 67.24806 67.35680
##   [9] 66.51190 67.15650 68.08892 67.32256 66.85687 68.05746 68.35510 67.20811
##  [17] 67.27213 67.87856 68.08440 67.12507 67.41706 67.81728 66.81242 67.10217
##  [25] 68.00220 67.75950 67.64402 68.30388 68.12805 68.23518 67.58999 66.94302
##  [33] 67.39982 66.55793 68.37569 67.42836 66.95024 67.63060 66.98188 65.41056
##  [41] 67.74095 67.56557 67.97687 69.48090 68.87190 66.52169 66.69663 68.20452
##  [49] 67.70973 67.66911 68.53079 67.80774 68.65878 68.13880 66.06411 66.96215
##  [57] 67.75445 68.36795 67.68582 67.95922 67.04305 68.16526 67.71159 66.70969
##  [65] 66.95962 66.90065 67.27121 67.18884 67.54167 67.57160 67.20071 68.04710
##  [73] 68.00670 67.33173 66.14858 66.81339 69.18375 66.20122 68.53896 67.28485
##  [81] 68.06559 67.20683 67.34291 69.11458 68.93478 66.41522 67.30326 67.11049
##  [89] 68.10434 67.24163 69.13513 67.14035 68.91565 67.61473 66.81637 66.47203
##  [97] 67.91428 67.18752 67.98148 67.55006
pop.sd(meanheights) #Use pop.sd because this set of 30 is our sample
## [1] 0.7446121
meanweights
##   [1] 140.0907 143.2483 144.0797 137.7401 141.5288 147.6797 142.6640 144.9039
##   [9] 139.5361 148.4602 145.1113 142.5367 139.8036 143.8335 145.5777 143.4344
##  [17] 143.6803 143.5202 145.3062 142.7556 143.5188 145.1753 141.8604 143.3624
##  [25] 146.4219 141.7633 144.9750 145.4961 146.5395 141.5175 142.5447 139.7315
##  [33] 142.6017 139.4531 146.7341 142.9514 145.1526 142.2200 140.6718 136.9862
##  [41] 144.1741 142.9786 145.5574 153.1952 149.0717 141.5394 138.2123 145.5891
##  [49] 143.4246 142.6918 146.3574 144.7914 146.6092 145.2831 137.3378 141.0074
##  [57] 146.7912 148.6044 142.0808 143.9643 142.0499 145.7274 140.4724 138.5629
##  [65] 136.1903 142.8196 143.6204 141.5538 142.7956 142.1601 142.9308 149.0227
##  [73] 144.2475 141.2018 138.8089 139.3910 148.5339 138.3440 147.7413 142.2194
##  [81] 147.4112 140.5679 143.7222 150.1121 151.7919 136.1079 141.8047 140.4299
##  [89] 147.8720 141.9750 150.9282 142.3690 149.0885 145.5944 136.6159 142.7691
##  [97] 143.8571 144.4042 143.2351 142.9427
pop.sd(meanweights)
## [1] 3.399663
meanages
##   [1] 19.73969 19.97911 20.17565 19.38105 20.21627 20.48333 19.99196 19.77483
##   [9] 19.75595 18.73559 20.46096 20.36481 20.24227 20.12593 20.63013 19.04746
##  [17] 19.42419 19.83671 19.88500 19.84707 20.21711 19.25176 19.92363 19.52086
##  [25] 19.81265 20.26436 19.75198 20.18679 20.33098 20.77552 20.42170 19.91804
##  [33] 20.34073 20.02700 20.34340 20.11446 19.40485 20.01376 19.39052 19.00172
##  [41] 20.80079 20.00642 19.98483 20.40055 20.28290 19.30107 20.05877 20.69212
##  [49] 20.28611 20.15783 20.47930 19.84130 20.95639 20.04975 20.04626 19.90198
##  [57] 19.67960 20.30391 20.47013 20.21050 19.28685 19.97429 19.94750 19.96976
##  [65] 20.91439 19.55141 19.76453 19.70146 20.34984 20.50095 19.71627 19.58723
##  [73] 20.60295 19.51820 19.83709 19.99240 20.56904 19.34244 20.59257 19.88068
##  [81] 19.40402 20.35302 19.85417 20.96199 20.34409 19.96717 19.79033 20.04631
##  [89] 20.27410 19.80604 21.15188 19.79293 20.42481 20.51907 19.69919 19.28075
##  [97] 20.56524 19.70941 20.04801 20.60274
pop.sd(meanages)
## [1] 0.4578838
meankills
##   [1] 3.000000 3.100000 2.866667 3.133333 3.000000 3.000000 2.966667 3.000000
##   [9] 3.100000 3.300000 2.166667 2.766667 2.900000 3.600000 2.900000 3.133333
##  [17] 2.900000 3.133333 3.233333 3.000000 2.900000 3.233333 2.500000 3.066667
##  [25] 2.166667 2.700000 2.933333 3.066667 2.966667 2.866667 2.966667 3.233333
##  [33] 3.266667 2.300000 3.033333 3.800000 2.700000 3.200000 2.933333 2.900000
##  [41] 3.133333 3.366667 2.833333 3.300000 3.066667 2.866667 3.133333 3.133333
##  [49] 2.900000 3.366667 3.133333 2.566667 2.900000 2.533333 3.100000 3.000000
##  [57] 3.066667 3.600000 2.766667 3.100000 2.733333 3.066667 3.100000 2.700000
##  [65] 2.866667 2.233333 2.533333 2.866667 3.566667 2.900000 2.733333 3.833333
##  [73] 2.700000 3.400000 3.466667 3.400000 2.933333 2.766667 3.500000 3.400000
##  [81] 3.200000 3.000000 3.133333 2.900000 2.300000 2.933333 2.500000 3.500000
##  [89] 2.600000 2.800000 3.000000 2.266667 3.333333 2.700000 3.233333 3.000000
##  [97] 3.200000 3.066667 2.966667 2.833333
pop.sd(meankills)
## [1] 0.3252929
meanedus
##   [1] 3.133333 2.733333 3.366667 2.433333 3.266667 2.900000 2.833333 3.033333
##   [9] 3.166667 2.833333 3.366667 2.933333 3.233333 3.266667 2.733333 2.266667
##  [17] 3.666667 3.366667 3.366667 3.200000 2.900000 3.066667 3.300000 3.066667
##  [25] 2.266667 2.933333 2.966667 2.933333 2.500000 3.333333 3.100000 3.366667
##  [33] 2.500000 3.000000 3.233333 3.266667 3.066667 2.800000 2.700000 2.933333
##  [41] 3.500000 2.600000 3.666667 3.366667 2.433333 2.866667 3.533333 3.666667
##  [49] 2.866667 3.000000 3.333333 2.833333 3.566667 3.166667 2.933333 2.433333
##  [57] 3.266667 2.900000 2.600000 3.100000 3.266667 2.800000 3.433333 3.033333
##  [65] 3.133333 2.966667 3.100000 2.600000 2.400000 2.600000 2.833333 3.000000
##  [73] 3.000000 3.166667 3.166667 3.066667 2.933333 3.233333 2.800000 2.600000
##  [81] 3.000000 2.633333 3.466667 3.066667 3.066667 2.700000 2.966667 2.633333
##  [89] 3.000000 2.800000 3.533333 3.066667 2.900000 2.600000 3.100000 3.366667
##  [97] 2.766667 3.200000 2.733333 2.733333


The standard deviations are smaller from this population are smaller than that of the one from Question 5.

par(mfrow = c(1,2))
hist(meanheights) #makes a basic histogram
qqnorm(meanheights, frame = FALSE) #plots a qq graph 
qqline(meanheights) #plots qq line on the graph for analysis

par(mfrow = c(1,2))
hist(meanweights) 
qqnorm(meanweights, frame = FALSE)
qqline(meanweights)

par(mfrow = c(1,2))
hist(meanages)
qqnorm(meanages, frame = FALSE)
qqline(meanages)

par(mfrow = c(1,2))
hist(meankills)
qqnorm(meankills, frame = FALSE)
qqline(meankills) 

par(mfrow = c(1,2))
hist(meanedus)
qqnorm(meanedus, frame = FALSE)
qqline(meanedus)


Based on th histograms and qqplots, they all seem more likely to be drawn from a normal distribution. However, mean weight is likely not drawn from normal distributed data.